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February  24 


WORKSHOP  1 
and  25,  1987 


DAY  1 


TIME  EVENT 


BALLROOM  "A" 

8:00  Welcome  from  Douglas  Aircraft  Engineering. 
( Peterson) 

8:10  Welcome  from  USAF/FAA  sponsors. 

( Britten- Austin  and  Hwoschinsky) 

8:20  Objectives  of  workshop. 

( Bif erno ) 

8:30  What  is  workload  certification? 

( Gabriel ) 

9:00  Workload  Assessment  and  certification. 
(Fadden) 


9:30  Methodological  issues  concerning  workload  measurement  during 
certification. 

(Biferno,  Boucek) 

10:15  Break 

10:30  Subjective  workload  measurement  panel:  A  review  of  the 
evidence  regarding  validity  and  reliability. 

Reviewer  Hart 

Reviewer  Reid 

Discussant  Gopher 

12:00  Lunch  -  LOUNGE 

1:00  Performance  workload  measurement  panel:  A  review  of  the 
evidence  regarding  validity  and  reliability. 

Reviewer  Wickens 

Reviewer  Eggemeier 

Discussant  McCloy 

2:30  Break 


2:45  Physiological  workload  measurement  panel:  A  review  of  the 
evidence  regarding  validity  and  reliability. 


Reviewer 

Reviewer 

Discussant 


Kramer 

Wilson 

Stern 
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DAY  1  (continued) 


TIME  EVENT 

4:15  Review  of  criteria  for  FACT  MATRICIES  categorization. 
(Williams) 

4:30  Review  of  participant  Fact  Matrices  by  workload  panel. 

Three  subgroups  will  be  created  with  panel  members  as  leaders. 

Subjective  subgroup  in  room  #  106 

Performance  subgroup  in  room  #  136  (Wednesday  in  room  #  121) 
Physiological  subgroup  in  room  #151 

5:30  Adjourn. 

7:00  Banquet  -  MONTEGO  BAY  ROOM 

DAY  2 

TIME  EVENT 

8:00  Continue  review  of  Fact  Matrices  by  workload  panels  in 
subgroups.  Cite  evidence  for  additions  or  deletions  to 
original  matrix. 

12:00  Lunch  -  LOUNGE 

BALLROOM  -A" 

1:30  Review  Fact  Matrices  of  Subjective  Measures. 

2:00  Review  Fact  Matricies  of  Performance  Measures. 

2:30  Review  Fact  Matricies  of  Physiological  Measures. 

3:00  Break 

3:15  Briefing  on  how  measures  will  be  implemented  in  simulation 
scenario . 

(Corwin,  Sandry-Garza ) 

4:15  Concluding  remarks. 

(Boucek,  Biferno) 

4:30  Survey  of  attendees  regarding  the  best  workload  measures. 

4:45  Turn  in  survey  and  final  version  of  Fact  Matrices. 

-  MONTEGO  BAY  ROOM 
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OBJECTIVES 


i 
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OBJECTIV Ei OF  THE  WORKSHOP  - 

GATHER  INFORMATION  FROM  WORKLOAD  EXPERTS  REGARDING  WHICH 
MEASURES  HAVE  EVIDENCE  SUPPORTING  THEIR  RELIABILITY  OR 
VALIDITY. 


OBJECTIVES  OF  PANEL  DISCUSSIONS 

PROVIDE  AN  INDEPENDENT  REVIEW  OF  THE  FACTS  CONCERNING  THE 
VALIDITY  AND  RELIABILITY  OF  WORKLOAD  MEASURES. 


OBJECTIVE  OF  THE  SUBGROUP  SESSIONS 

PROVIDE  A  MEANS  FOR  SYSTEMATICALLY  REVIEWING  AND  MODIFYING  THE 
FACT  MATRICES. 
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PHYSIOLOGICAL 
(In  room  #151 ) 

Dr.  Peter  Hancock 

Ms.  Kathleen  Hayward  (assistant) 

Dr.  Arthur  F.  Kramer  (moderator) 
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PERFORMANCE 
(Tuesday  in  room  #136) 

(Wednesday  in  room  #121) 

Ms.  Janet  Barbato  (assistant) 

Mr.  Michael  R.  Bortolussi 
Dr.  George  Boucek 
Dr.  William  Corwin 
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Dr.  Michael  Dresel 

Dr.  F.  Thomas  Eggemeier  (moderator) 
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1630 


0800 


1030 


1130- 


WORILOAD  FACT  MATRIX  REVIEW 
PANEL  AGENDA 
( suggested ) 


ITEM  ITEM  DESCRIPTION 


1730  1  . 


1030  2. 


1  130  3. 


1200  4. 


CRITERIA  FOR  FACT  MATRIX  REVIEW 

a.  Reliability 

b.  Validity 

0.  Empirical  data 

d.  Capability  for  flight  use 

e.  Rejection/ Addition  rationale  and  support 

REVIEW  OF  EACH  MEASURE  ON  FACT  MATRIX 

Prooeaa : 

1)  Pick  workload  aeasure  for  discussion 

2)  Identify  empirical  material  which 
provide  evidence  on  validity  and 
reliability  of  measure 

3)  Solicit  change  recommendations 

4)  Review  evidence  for  change 
recommendations 

5)  Speaker  Panel  form  recommendation 

6)  Repeat  process  for  next  aeasure 

REVIEW  MEASURES  NOT  PREVIOUSLY  INCLUDED  IN  FACT 

MATRICES 

Process : 

1)  Identify  a  new  aeasure  for  inclusion 

2)  Review  evidence  for  addition  of  aeasure 

3)  Speaker  Panel  form  recommendation 

4)  Repeat  process  until  no  additional 
measures  identified 

P0RMULATE  SUMMARY  REPORT 

Summarize : 

additional  reference  items 
delete  reference  items 
additional  measures 


*v 


WORKLOAD  FACT  MATRIX  REVIEW 


The  purpose  of  the  Subgroup  reviews  is  to  modify  the  FACT  MATRICES  in 
a  systematic  and  orderly  fashion.  The  FACT  MATRICES  are  basically 
locators  for  Reliability  and  Validity  information  for  each  Workload 
measurement  type.  We  want  you  to  add  reference  work  to  the  FACT 
MATRICES  if  they  contain  Reliability  and  Validity  information.  We 
are  not  asking  your  panel  to  judge  the  quality  of  the  information, 
but  determine  whether  the  material  addresses  the  measure's 
Reliability  or  Validity.  The  evidence  can  be  weak  and  still  be 
acceptable  for  inclusion  in  the  FACT  MATRIX.  On  the  other  hand,  if 
we  entered  references  into  the  FACT  MATRICES  which  are  not 
appropriate  or  correct,  we  ask  you  to  delete  those  items.  Regardless 
of  the  modifications  recommended  by  your  subgroup,  evidence  must  be 
given  for  the  addition  or  deletion  of  items.  It  i3  up  to  the  Speaker 
Panel  to  decide  on  the  acceptability  of  the  evidence  and  on  the 
decision  to  add  or  subtract  from  the  subgroup  sessions  and  direct  the 
decision  process  that  the  group  employs  to  modify  the  FACT  MATRICES. 

We  ask  that  each  subgroup  first  employ  the  Anastasi  and  Guilford 
definitions  of  Validity  and  Reliability  to  the  workload  literature. 

We  are  aware  that  the  workload  field  has  not  generally  focused  its 
resources  on  demonstrations  of  reliability  because  there  is 
disagreement  on  the  definitions  and  content  areas  of  the  workload 
construct.  Therefore,  after  addressing  the  literature  using  the 
Anastasi  and  and  Guilford  criteria,  you  are  free  to  employ  the 
validity  and  reliability  definitions  of  your  choice  to  support  the 
contention  that  a  particular  measure  is  valid  and  reliable.  We  only 
ask  that  you  explicitly  define  these  definitions  in  your 
Justification  for  that  measure.  Remember  the  studies  suitable  for 
supporting  validity  or  reliability  must  be  empirical,  not  review  or 
theory  articles. 


VALIDITY  DEFINITIONS 


1.  FACE  VALIDITY:  "...  pertains  to  whether  the  test  ' loods  valid' 
to  the  subjects  who  take  it,  the  administrative  personnel  who  decide 
on  its  use,  and  other  technically  untrained  observers." 

IMPORTANCE:  "Certainly  if  a  test  appears  irrelevant,  inappropriate, 
silly,  or  childish,  the  result  will  be  poor  cooperation,  regardless 
of  the  actual  validity  of  the  test."  ...  "For  example,  it  a  test  of 
simple  arithmetic  reasoning  is  constructed  for  use  with  machinists, 
the  items  should  be  worded  in  terms  of  machine  operations  rather 
than  in  terms  of  'how  many  oranges  can  be  purchased  for  36  cents'  of 
other  traditional  schoolbook  problems." 

REFERENCE:  Anastasi,  A.  (1968).  Psychological  Testing.  3rd 
Edition,  Macmillan,  London,  p.  104. 

2.  CONTENT  VALIDITY:  "  ...  involves  essentially  the  systematic 

examination  of  the  test  content  to  determine  whether  it  covers  a 
representative  sample  of  the  behavior  domain  to  be  measured."  ... 
"The  content  area  to  be  tested  must  be  systematically  analysed  to 
make  certain  that  all  major  aspects  are  adequately  covered  by  the 
test  items,  and  in  the  correct  proportions." 

IMPORTANCE:  "  ...  content  validity  depends  on  the  relevance  of  the 
individual's  test  responses  to  the  behavior  area  under 
consideration,  rather  than  on  the  apparent  relevance  of  item 
content . ” 

REFERENCE:  Anastasi,  A.  (1968).  Psychological  Testing.  3rd 
Edition,  Macmillan,  London,  p.  100. 

3.  CONSTRUCT  VALIDITY:  *  ...  is  the  extent  to  which  the  test  may  be 
said  to  measure  a  theoretical  construct  or  trait."  ...  "requires  the 
gradual  accumulation  of  information  from  a  variety  of  sources.  Any 
data  throwing  light  on  the  nature  of  the  trait  under  consideration 
and  the  conditions  affecting  its  development  and  manifestations  are 
grist  for  this  validity  mill." 

IMPORTANCE:  "  ...  construct  validity  is  a  comprehensive  concept, 

which  lnoludes  the  other  types.  All  specific  techniques  for 

establishing  content  and  criterion-related  validity  ...  could  be 
listed  again  under  construct  validity." 

REFERENCE:  Anastasi,  A.  (1968).  Psychological  Testing.  3rd 

Edition,  Macmillan,  London,  p.  114-5,  121. 

4.  PREDICTIVE  VALIDITY:  (CRITERION  RELATED  VALIDITY)  "  ... 

indicates  the  effectiveness  of  a  test  in  predicting  an  individual's 
behavior  in  specific  situations.” 

IMPORTANCE:  "For  this  purpose,  performance  on  the  test  is  checked 
against  a  criterion,  i.e.,  a  direct  and  independent  measure  of  that 
which  the  test  is  designed  to  to  predict.” 

REFERENCE:  Anastasi,  A.  (1968).  Psychological  Testing.  3rd 

Edition,  Macmillan,  London,  p.  105. 
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RELIABILITY  DEFINITIONS 


1.  TEST-RETEST  RELIABILITY:  *  The  key  concept  for  this 

( test-retest )  procedure  is  that  of  stability.  It  answers  the 
question  concerning  how  stable  or  dependable  are  the  measurements 
over  a  period  of  time." 

IMPORTANCE:  "High  reliability  of  this  kind  tells  us  that  the 
individuals  remain  rather  uniform,  or  maintain  their  rank 

positions  in  spite  of  changes,  in  whatever  psychological  functions 
this  test  measures." 

REFERENCE:  Guilford,  J.  P.  (1954).  Psychometric  Methods,  2nd 

Edition,  McGraw  Hill,  New  York,  p.  374. 

2.  SPLIT  HALF  RELIABILITY:  "The  information  sought  (from 

fractionation  of  a  test  into  two  or  more  parts)  concerns  the 

equivalence  of  parts  for  measurement  purposes,  or  the  internal 
consistency  of  the  test." 

IMPORTANCE:  "In  the  case  of  the  split-half  method,  the 
Spearman-Brown  formula  has  usually  been  applied  to  estimate  the 
reliability  of  the  test  of  full  length  from  the  obtained  estimate 
of  correlation  of  a  test  of  half  length." 

REFERENCE:  Guilford,  J.  P.  (1954).  Psychometric  Methods,  2nd 

Edition,  McGraw  Hill,  New  York,  p.  373- 

3.  ALTERNATE  FORMS  RELIABILITY:  "...  method  bears  resemblances  to 
both  the  internal  consistency  approach  and  the  retest  approach, 
the  end  result  is  an  index  of  how  equivelant  the 
psychological-measurement  content  of  one  form  of  the  test  is  with 
the  content  of  another." 

IMPORTANCE:  "...  the  alternate-forms  method  indicates  both  the 
equivelance  of  content  and  stability  of  performance."  ...  "  Some 
investigators  prefer  the  alternate-forms  type  to  the  internal 
consistency  type  of  coeffecient  for  the  reason  that  they  are 
interested  in  how  much  stability  to  expect  of  scores  over  time." 

REFERENCE:  Guilford,  J.  P.  (1954).  Psychometric  Methods,  2nd 
Edition,  McGraw  Hill,  New  York,  p.  374,5. 

4.  INTER-RATER  RELIABILITY:  "  ...  rater  intercorrelations,  which 
indicate  the  internal  consistency  among  raters.  Such  correlations 
have  usually  been  regarded  as  indices  of  rating  reliability  ..." 

IMPORTANCE:  If  raters  agree,  demonstrate  high  intercorrelations , 
then  the  number  of  raters  required  to  generate  a  significant 
result  cna  be  reduced. 

REFERENCE  Guilford,  J.  P.  (1954).  Psychometric  Methods,  2nd 
Edition,  McGraw  Hill,  New  York,  p.  286,7. 
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CHECK 
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FAR  RELEVANT  TO  CREW  WORKLOAD 
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APPENDIX  D  -  SUMMARY 
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PE  f  l  M  IT \OM-S 

"Mental  workload  may  be  viewed  as  the  difference  between 

CAPACITIES  OF  THE  INFORMATION-PROCESSING  SYSTEM  THAT  ARE 

REQUIRED  for  task  performance  to  satisfy  performance 

EXPECTATIONS  AND  THE  CAPACITY  THAT  IS  AVAILABLE  AT  ANY 

given  time.  Gopher  and  Donchin,  1936. 


'The  construct  of  spare  capacity,  derived  from  models  of 

ATTENTION,  IS  THE  MOST  IMPORTANT  COMPONENT  OF  MENTAL 
WORKLOAD. . . 

"However,  mental  workload  is  more  than  just  spare 
capacity.  Additional  aspects  of  mental  workload  include 

SUBJECTIVE  FEELINGS,  EFFORT,  I ND I VI  DUAL  D I FFERENCES, 
STRATEGY  AND  PRACTICE."  i<ANTOWITZ,  19S6. 


"Workload  is  fundamentally  defined  in  terms  of  this 
RELATION  BETWEEN  resource  supply  and  task  demand. 
Wickens,  1984. 


Workload  assessment  techniques  are  principally  designed 

TO  MEASURE  THE  DEGREE  OF  OPERATOR  PROCESSING  CAPACITY 
which  is  expended  IJJ  performing  A  PARTICULAR  T ASX  OR 
SYSTEM  FUNCT’ON.  C.3GEME I  SR,  Sh I NGLEDECKER,  AND 

Crabtree,  1935. 


,,,-HE  DEFINITION  OF  WORKLOAD  l.i  TERMS  OF  THE  ATTOrTMt* 
REQUIRED  3Y  A  TASK,  OR  THE  ADDITIONAL  CAPACITY  YfT 
REMAINING  TO  PERFORM  ANOTHER  TASK,  WTTH  POSSIBLE 
REFERENCE  TO  THE  INTENSITY  OF  MENTAL  OR  PHYSICAL  EFF«X 

exerted.  Hart  and  Sheridan,  1984. 


"^ggGN'T  know!'  'Nobody  else  knows  either.'  KantowtWc 


'There  is  no  agreed-upon  definition  of  mental  workload 
AND  NO  AGREEMENT  ON  HOW  TO  MEASURE  IT.  MORAY ,  1932. 
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?i tura  «.  '  Jnoled  error  performance  md  mm  control 

capacity  acaciaciea  wicl.  cha  croaa-coupled  tank  dteguiaed 
u  cha  apiral  divergence  of  a  aiaai laced  C-5H  STOL  aircraft. 
Tha  plan  elevation  of  cha  inland  ad  approach  profile  ara  alao 
ahown.  Tha  aircraft  waa  controllod  manually  bp  a  pilot 
uaing  raw  aituation  data  undar  ina  t  rumen  c  flighc  rulaa 
without  any  flight  director.  Each  ayafcol  rapraaanta  tha 
time-averaged  naan  value  and  plua  or  ninua  one  root-mean- 
aquara  value  batvaan  tha  nu^erad  waypoint a  on  tha  approach 
profile.  Tha  adaptive  apiral  divergence  vaa  eroea-coupled 
to  a  weighted  linear  coahination  of  the  three  error  perf¬ 
ormance  aeaauraa  ahown,  and  the  weighted  error  rafaranea 
was  ten  per  cent  greater  than  the  pilot' a  own  baaelina 
error  performance  without  the  croaa-coupled  adaptive 
loading  teak.  The  maximum  poeeible  value  of  the  aaceee 
control  capacity  neaauranent  waa  limited  at  0 .15.  Thia 
maximum  value  wee  reached  between  waypeinta  3  and  7. 

.  CTrom, Clement .  ref., 36)  .?  -r 
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INTRUSIVENESS  :  ASSUMED  TO  BE  NONINTRUSIVE 


REPRESENTATIVE  APPLICATIONS  OF  PRIMARY  TASK  MEASURES 
IN  AVIATION  OR  RELATED  ENVIRONMENTS 


NORTH,  STACKHOUSE,  &  GRAFFUNDER  (1979) 
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SOME  DATA  ON  SINGLE-TASK  VERSIONS  OF 
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REPRESENTATIVE  APPLICATIONS  OF  SECONDARY  TASK 
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COMMUNICATION  TASK  PERFORMANCE 
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Subjective  Group  Meeting  -  2/25/87 

Synopsis:  NASA  TLX  and  SWAT  were  the  measurement  techniques 
recommended  for  use  in  the  Part  Task  testing.  There  are 
enough  references  and  data  collections  using  these  measures 
to  demonstrate  their  validity  and  reliability.  The  Modified 
Cooper-Harper,  a  unidimensional  scale,  was  discussed  but  not 
recommended  as  it  gives  a  "number"  but  does  not  allow  you  to 
work  backward  to  how  the  number  was  developed  and  thus, 
offers  no  diagnosticity . 

Meeting  Summary 

Danny  Gopher  said  the  goal  of  the  meeting  was  to  come  up 
with  recommendations  for  workload  assessment  methodology  in 
certification  of  aircraft.  We  must  review  the  problem  of 
certification  as  well  as  the  process  and  identify  key  issues 
in  certification  where  our  methodology  can  help. 

Del  Fadden  explained  the  Boeing  certification  process  and 
showed  the  Pilot  Subjective  Evaluation  (PSE)  form  that  is 
used  by  Boeing  during  the  certif ication  program.  In  this 
program  the  pilot  is  asked  to  compare  the  new  aircraft  with 
a  known  reference  aircraft.  If  a  pilot  says  a  task  is  mere 
difficult,  then  he  must  discuss  why  it  is  more  difficult. 

Del  pointed  out  that  before  certification,  in  simulation, 
most  of  the  cockpit  design  work  has  been  part  task.  A  lot 
of  design  is  done  even  before  the  simulator  is  available. 

An  airplane  evolves  -  other  planes  are  used  for  reference. 

Jean-Jacques  Speyer  said  Airbus  takes  a  unidimen3ional  view. 
They  ask  for  lots  of  ratings,  actually  take  ratings 
continuously  on  line.  He  proposed  that  all  ratings  together 
give  a  microscopic  evaluation  as  to  what  is  going  on.  The 
Air  Worthiness  observer  also  gives  a  subjective  rating.  If 
the  observer  sees  that  the  pilot  has  not  made  a  rating  at 
some  particular  time,  he  can  ask  the  pilot  to  give  ratings 
at  that  point.  (Instrument  set  up  -  observer  gives  crew 
member  a  light  and  pilot  is  invited  to  give  a  rating  at  that 
time).  Ratings  are  not  pass-fail,  ratings  are  given  on  an 
absolute  scale. 

Group  discussion  followed  on  the  two  techniques.  In 
summation,  Boeings'  method  is  a  comparative  measure,  Airtus 
uses  an  absolute  technique.  Boeing  takes  pilot  ratings  pcs*- 
flight,  Airbus  takes  ratings  in-flight  and  at  tne  enu  v.  the 
flight  pilots  are  given  a  long  questionaire  -  150  questions. 

Gopher  said  we  must  define  tne  problem.  Why  are  - ub  e,.:t i ve 
ratings  important9  What  does  pilot  feel  about  *he  tasK  he 
must  perform?  Is  he  comfortable  with  it?  Ail  part  * c ; pants 


a^ree  *  oat  *  r.^y  u  se  pilot  subjective  opinion  in  one  form  or 
at  -ther  it  pr.er  belt  it  would  be  useful  if  we  could 
s '  ar.  iariice  am  use  tne  same  scale  in  all  nations  so  we  can 
learn  from  each  other 

In  the  discussion  that  followed  on  standardization  the 
following  points  were  made 

1.  We  need  experienced  pilots  to  make  judgements. 

2.  We  need  to  have  some  comparative  point  of  reference. 

3.  Ratings  should  be  made  under  some  specified  optimum 
conditions . 

4.  Raters  must  be  trained  to  rate. 

5.  Everyone  must  agree  on  the  dimensions  and  terms  of 
workload  so  results  will  be  communicable. 

The  point  was  made  that  if  the  same  method  was  used  in 
design  and  certification,  then  there  would  be  no  surprises, 
although  the  method  would  be  used  in  certification  in  a  more 
limited  sense.  Del  Fadden  pointed  out  that  this  is  not 
always  possible,  for  example,  for  the  design  of  much  of  the 
7J7  there  is  no  reference  airplane. 

Danny  Gophers  opinion  on  selection  of  rating  scales  is  to 
use  NASA  TLX  and  SWAT.  He  said  they  are  the  most  viable  and 
best  documented.  He  said  the  Modified  Cooper-Harper  is 
"dead",  not  being  used  and  is  based  on  old  research.  He 
would  like  to  see  both  NASA  .TLX  and  SWAT  used,  at  least  to 
start . 

Modified  Cooper  Harper 

Sandy  Hart  looked  at  the  Modified  Cooper-Harper  when  she 
began  to  develop  a  scale.  In  this  scale  the  raters  move 
through  a  series  of  decisions,  these  decisions  determine 
whether  the  rater  ends  up  at  the  top,  middle,  or  bottom  of 
the  scale.  You  can’t  take  the  final  number  and  work 
backwards  to  determine  how  the  number  was  developed.  NASA 
wanted  a  diagnostic  scale  that  would  allow  them  to  work 
backwards . 

NASA  TLX 

Subjects  were  asked  to  give  their  definition  of  w.-.r:doid 
before  tney  started  the  tests  to  reduce  intersubject 
variability .  During  tne  tests  subjects  were  asned  wh  it  the;, 
felt  caused  the  primary  source  of  workload ,  then  that  task 
was  given  mere  weignt  This  worxs  well  for  simple  tasks , 
but  weights  don’t  have  as  much  effect  in  complex  tasi-is  It 
was  felt  that  we  i «rhts  should  be  used  for  both  simple  ana 
complex  t.-  .  however 


It  was  pointed  out.  that  the  FAA  is  interested  in  -.he  A 
workload  functions  listed  in  the  FARs  so  perhaps  a  to 


LI  .  \  .5  l  C 


should  be  developed  to  evaluate  these  6  functions.  These 
functions  are;  flight  path  control,  collision  avoidance, 
navigation,  communication,  operation  and  monitoring  of 
aircraft  engines  and  systems,  and  command  decisions.  It  was 
suggested  that  the  lAA  could  be  asked  to  weight  these 
functions  so  the  manufacturers  would  have  a  standard  to  work 
against . 

SWAT 

Gary  Reid  said  he  and  his  colleagues  did  not  set  out  to 
develop  a  subjective  scale.  They  were  asked  as  workload 
experts  how  to  measure  workload. 

They  started  out  with  the  Simpson-Sheridan  scale.  The 
disadvantage  of  this  scale  is  that  it  takes  a  lot  of 
training  to  properly  use  its  decision  tree  rating  scale. 

Next  they  then  took  a  three  dimensional  scale,  regressions 
approach,  looking  for  the  smallest  set  of  dimensions  that 
would  tell  them  the  most.  They  were  driven  by  practical 
considerations  such  as  cost.  For  example,  is  increased 
sensitivity  worth  the  cost?  This  scale  fits  their  needs. 

They  can  gather  ratings  with  some  diagnosticity .  In  their 
ratings,  they  feel  that  if  they  hadn’t  had  observers,  they 
couldn’t  tell  why  workload  is  higher  in  some  instances. 

SWAT  covers  4  of  the  6  functions  listed  in  the  FARs. 
Performance  and  physical  demand  are  not  covered.  Gary  said 
that  if  they  need  performance  they  measure  it.  Measuring 
mental  workload  is  important.  Physical  i3  not  as  important 
as  it  once  was. 

Discussion  followed  on  the  importance  of  getting  down  to  a 
"one  number"  assessment,  is  the  extra  effort  to  get  there 
really  necessary?  It  was  felt  by  some  "one  number"  provides 
structure  and  information  and  is  a  valuable  adjunct  to  what 
is  going  on. 

Danny  Gopher  said  that  the  charge  of  the  subjective  group 
was  to  evaluate  measures  for  use  in  the  simulation  study. 

It  has  been  concluded  that  it  is  important  to  use  subjective 
measures.  He  reiterated  that  all  manufacturers  use 
subjective  measures  in  one  form  or  another.  This  group 
strongly  recommends  the  use  of  subjective  measures.  It  is 
reasonable  and  advisable  to  find  a  systematic  way  to  elicit 
these  opinions.  He  feels  that  standardization  is  one 
approacn.  How  to  get  the  information  from  the  pilots  is  the 
question? 

Both  NASA  TLX  and  SWAT  give  global  assessments  of  tasks  as  a 
whole.  Th":  i  are  enough  references  and  data  collections 
using  them  to  provide  valid  and  reliable  results.  The 
Airbus  on-line  measurement  model  is  different  because  one 
must  design  situations  so  a  single  number  will  be  meaningful 


and  one  must  have  many  such  numbers.  It  is  a  costly  and 
demanding  process  and  is  not  well  documented.  Gopher  would 
like  Airbus  to  try  NASA  TLX  and  SWAT  along  with  their  method 
to  see  if  the  methods  agree. 

Discussion  followed  on  the  importance  of  using  both  NASA  TLX 
and  SWAT.  The  following  points  were  made; 

1.  The  way  in  which  you  communicate  with  the  pilot  is 
different . 

2.  There  are  differences  in  the  weighting  systems  of  the  two 
scales 

3.  It  i3  important  to  generate  more  than  just  a  single 
measure  of  workload 

The  concept  was  discussed  that  assessment  is  a  package,  not 
a  scale.  Raters  must  be  qualified  and  pilots  must  be 
trained.  NASA  TLX  and  SWAT  are  both  used  by  in  flight,  as 
often  as  possible.  Ratings  are  taken  as  soon  as  possible. 
Both  Sandy  Hart  and  Gary  Reid  stated  that  if  data  is 
collected  in  flight,  they  are  more  comfortable  with  it. 
Results  of  tests,  however,  do  not  back  up  feelings  that  in¬ 
flight  measurement  is  better.  Gopher  felt  that  if  we  can 
get  data  post-flight  with  no  change  in  results,  that  would 
be  an  important  finding. 

Much  discussion  followed  on  what  number  is  good  enough. 
Should  some  subjective  measures  be  taken  in-flight  ana  some 
post-flight?  If  the  measures  are  taken  in-flight,  are  they 
intrusive?  It  was  the  feeling  of  the  group  that  measures 
don’t  always  have  to  be  intrusive,  for  example,  after 
landing  you  can  gather  ratings  on  outer,  middle,  inner 
marker  and  landing. 

Danny  Gopher  said  he  would  summarise  the  subjective  group 
meeting  resuits  in  the  general  meeting  for  15  minutes,  then 
allow  Sandy  Hart  and  Gary  Reid  to  give  summaries  since  they 
don't  agree  with  Danny  on  many  points.  Sandy  and  Gary  said 
they  gave  their  presentations  yesterday.  Sandy  stated  that 
Gopners  opinion  does  not  accurately  reflect  that  of  the 
group 
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MINUTES 


Performance  Group  Meeting  -  2/25/87 


Summa  r 


The  performance  group  discussed  the  cost  of  measuring  workload  for 
behavioral,  physiological,  and  subjective  measures.  Much  support  was 
given  for  including  a  Sternberg  Task  in  the  simulation  test  battery. 


Several  valid  and  reliable  measures  and  implementation  requirements 
were  agreed  upon.  The  measures  suggested  were:  Sternberg  task, 

critical  tracking,  choice  reaction  time,  mental  arithmetic,  and 
embedded  communication. 


The  Cost  of  Measuring  Workload  - 

Performance  measures  (procedural  errors):  Measure  impairments  to 

performance  . 

Physiological  measures:  Measure  psychosomatic  effects  of  stress  and 
occupational  diseases. 

Subjective  measures:  Measure  conscious  experience,  estimate  the 

ability  to  cope  with  goals  and  achieve  criteria,  and  are  sensitive  to 
one  general  "work  of  intentions*.  This  influences  performance  on  the 
very  general  level.  The  costs  or  possible  effects  of  subjective 
measures  are  that  misjudgements  may  affect  selection  of  goals  and 
criteria,  may  affect  motivation,  may  affect  risk  taking  behavior. 


Types  of  Certification  - 

The  group  established  that  certification  varies  depending  on  what 
organization  is  doing  the  certification. 

British:  Look  at  the  relationship  between  subjective,  physiological, 

and  behavioral.  A  study  by  Jean  Jacque  Speyer  was  mentioned  testing 
the  possibility  of  using  control  reversals  during  approach  as  a 
measure. 

Boeing:  Weather  is  a  random  event.  A  simulator  can  impose  whatever 

weather  desired.  Certification  uses  a  performance  margin.  Failures, 
weather,  etcetra  are  added.  Frequency  of  occurrance  can  be 
manipulated.  Performance  margin  can  be  large  or  small.  Uses 
traditional  timeline  analysis. 


Implementation  Requirements  for  User  Acceptance  - 

The  group  developed  the  following  requirements  to  insure  user 
acce  ptance : 

1  -  The  measure  must  be  non  intrusive. 

2  -  The  measure  must  conform  to  all  safety  standards. 

3  -  The  tasks  must  be  within  the  realm  of  "normal"  methods. 

4  -  The  measure  must  not  lower  crew  self  image. 

5  -  The  measure  must  be  non  career  threatening. 


Valid  and  Reliable  Measures  - 

The  group  suggested  and  agreed  upon  the  value  of  the  following 
measures  : 

1  -  Sternberg  ( a udi t or y/ vi sual )  can  be  highly  intrusive.  Must  make 
efforts  to  move  toward  "real"  or  "normal"  tasks. 

2  -  Critical  tracking  (psychomotor)  can  be  too  complicated.  This 
measure  was  considered  of  boarderline  acceptability  but  the  group 
voted  to  keep  it  as  it  can  be  useful. 

3  -  Choice  reaction  time  (visual). 

4  -  Mental  arithmetic  ( a udi t or y/ vi sual )  allows  for  much 
flexability.  Can  be  embedded  and  made  "realistic"  to  piloting  tasks. 

5  -  Embedded  communications  (auditory)  "normal"  secondary  tasks. 


Problems  and  Considerations  - 

The  group  raised  the  following  problems  and  considerations: 

1  -  There  may  be  a  problem  with  comparing  a  new  craft's  performance 
with  a  "reliable  and  safe"  old  craft's  performance.  The  comparison 
will  not  always  be  reliable.  New  crafts  are  often  too  different  from 
old  crafts.  This  can  yield  misleading  results. 

2  -  Must  consider  that  the  design  stage  Implements  engineering  test 
pilots,  the  testing  stage  implements  line  pilots,  and  the 
certification  stage  implements  engineering  test  pilots. 

3  -  Theory  must  guide  implementation. 

4  -  Must  consider  the  training  on,  and  difficulty  of,  tasks  (data 
vs.  resource  limit). 

b  -  Must  consider  fatigue  from  introducing  the  tasks  in  terms  of 
adding  energy  exp  -  iiture  to  the  crew  especially  at  work  underload . 


Sternberg  Task  as  a  Measure  - 

The  group  cited  several  workload  measurement  studies  looking  at 
handling,  displays,  and  crew  coordination  aspects  of  flight.  All 
studies  used  a  Sternberg  task. 

1  -  1982.  this  study  evaluated  2  HUD  display  formats.  The  measure 
used  was  a  visual  Sternberg  task.  Pilots  flew  ILS.  While  the 
experiment  was  in  progress,  the  experimentor s  recorded  several  verbal 
responses  such  as  whistling  and  short  comments  about  the  task. 

2  -  This  study  varied  frequency  of  input  during  a  terrain  following 
flight  director  profile.  An  adaptive  visual  and  auditory  Sternberg 
task  was  used.  A  top  level  of  error  was  preset.  The  study  found  that 
the  visual  Sternberg  task  discriminated  between  different  levels  of 
workload  while  the  auditory  did  not. 

3  -  This  study  varied  visual  disorientation  with  a  Malcom  horizon. 
The  measure  used  was  a  Sternberg  letter  presentation  task.  Results 
were  mixed. 

4  -  Dunn,  1985.  Helicopter  Proceedings  Conference.  This  study 
tested  kenisthetic  displays.  Pilots  "flew  by  the  feel"  of  the 
instruments. 

5  -  This  study  tested  a  cross  coupled  instability  tracking  system. 
The  measure  used  was  a  Sternberg  variable  pitch  task  where  the 
instability  level  was  varied. 

After  reviewing  these  studies,  the  group  agreed  the  Sternberg  task 
should  be  among  the  measures  tested  during  simulation. 


Questions  Raised  by  the  Group  - 

1  -  How  to  address  user  popularity?  Aspects  of  a  potential  measure 
may  appeal  to  large  air  line  manufacturers,  but  not  to  small  general 
aviation  manufacturers  or  vice  versa. 

2  -  How  to  assess  dimensionality?  What  is  being  measured? 

3  -  How  to  address  the  demonstration  of  worst  case?  What  are  the 
probabilities? 

4  -  How  to  assess  practicality?  We  must  consider  cost,  degree  of 
simulation,  and  flight  testing. 

5  -  How  to  address  methods?  We  must  determine  where,  when  and  how 
they  are  relavent.  We  must  decide  whether  or  not  to  test  the  limits. 

6  -  How  to  tailor  the  measure  to  each  situation  in  which  it  is 
used? 
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How  to  do  a  comparison  of  techniques? 
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Recommendation  for  immediate  implementation 
1.0  Constraints 

1.1  Measure  must  be  usable  in  aircraft  cockpit  and  simulator. 

1.2  Measurement  procedure  must  not  impose  secondary  task  performance 
requirements  on  subject. 

2.0  Recommended  measures 

2.1  Heart  rate  and  heart  rate  variability  (HR). 

2.2  EOG:  eye  blink  and  eye  movements. 

2.3  Perhaps  aural  canal  temperature. 

3.0  Rationale  for  choice  of  measures 

3.1  HR  and  HR  variability 

Reports  by  Mulder  and  Moray  dealing  with  HR  variability  as  measure  of 
mental  effort  and  comments  by  Alan  Roscoe  concerning  HR. 

3.1.1  HR  variability  as  defined  by  Mulder 

-  Abstracts  interbeat  interval  ( I B I )  for  fixed  number  of  IBI's 
(IBI  accurate  to  ±  5  msec).  Moray  uses  256  to  compute. 

-  Conduct  Lagrange  interpolation  for  every  500  msec. 

-  Autocorrelate  these  equidistant  values  with  maximum  lag  of  10X  of 
computed  values. 

-  Fourier  transform  autocorrelation  functions. 

-  Smooth  raw  spectral  densities  with  Hamming  window. 

-  Derive  power  of  five  spectral  points  (.06,  .08,  .10,  .12,  .14  Hz). 

-  Derive  natural  log  of  power  of  the  five  points. 

-  Activity  in  .06-. 14  Hz  is  sensitive  to  "mental  effort"  manipulation. 
Greater  energy  reflects  relaxation,  lower  energy  mental  effort. 


3 


3.1.2  Comments  about  procedure  from  principal  advocate(s) 

-  Morey  reported  that  a  variant  of  this  procedure  is  currently 
in  use  in  his  laboratory  and  works  to  index  mental  work  load 
in  a  variety  of  situations.  He  also  reported  that  although 
he  used  the  Mulder  algorithm,  he  could  not  replicate  Mulder 
results  in  earlier  investigations.  The  cause  of  this  in¬ 
ability  to  replicate  is  a  mystery.  Other  researchers  have 
experienced  and  reported  the  negative  results. 

-  Alan  Roscoe  reviewed  the  utility  of  HR  as  a  measure  of 
"arousal,"  "attention,"  or  "alertness."  This  measure  should 
be  useful  in  long  duration  flights.  In  the  present  context 
(30  minute  flights),  it  may  not  hold  much  promise,  although 
one  would  expect  that  the  two  failure  conditions  may  well  be 
associated  with  immediate  increases  in  HR.  Uhat  will  be 
difficult  to  determine  is  whether  such  increases  are  asso¬ 
ciated  with  the  psychological  threat  produced  by  the  equip¬ 
ment  failure  or  an  increase  in  motor  activity  associated  with 
dealing  with  the  problem. 

In  any  case,  HR  variability  cannot  be  measured  without  ac¬ 
quiring  HR  information,  therefore,  HR  remains  as  a  measure  in 
our  proposed  "battery." 

3.1.3  Potential  problem  with  respect  to  part-task  simulation  run  as 
described . 

-  Some  of  the  segments  are  only  1  or  1.5  min  in  duration. 

This  may  be  too  short  a  time  period  to  obtain  reliable 
spectral  information. 

3.1.4  Solutions 

-  Don't  analyze  data  for  segments  less  than  two  minutes  (or 
250  seconds)  with  Mulder  procedure. 

-  Combine  data  from  equal  work  load  segments? 

-  Sample  data  for  longer  periods,  if  at  all  possible. 

-  Await  Moray's  "on  line"  filter  technique  to  analyze  data. 

-  Try  v  technique  for  Vagal  tone  -  Porges  Black  box. 

3.1.5  The  recording  of  heart  rate  or  period  poses  no  major 
problem.  Such  measures  have  been  successfully  recorded 
both  in  the  simulator  and  under  flight  conditions. 


3?7 


wJVv.  -V  W-V.  -  W.  H~-  n  i.  ^ 


EOG  -  Eye  blink  and  eye  movements 


-  Recommendations  based  on  results  of  AAMRl./HEG  (Wilson),  UT-A!  SAM/VNE 
(Miller),  and  WUBRL/MBMHC  (Stern).  Measure  has  been  used  in 
laboratory  and  simulator,  and  is  currently  being  used  tor  in  flight 
monitoring  (AAMRL). 

3.2.1  Blink  measures  utilized 

-  Blink  rate  obtained  results:  differences  in  rate  between 
pilot  in  command  of  aircraft  and  second  in  command. 

-  Increase  in  blink  rate  as  a  function  of  time-on-task  have 
been  demonstrated  in  a  variety  of  conditions. 

-  Inhibition  of  blinking  is  related  to  visual  task 
di f f iculty. 

-  Blink  amplitude:  not  very  discriminating  for  effort  at 
hand.  USAFSAM  data  on  smaller  amplitude  blinks  being 
associated  with  poorer  performance  (probably  function  of 
partial  lid  closure  as  sleep  deprived  subjects  perform 
long  duration  (4  1/2  hours)  in  simulator  (Morris  data). 

-  Blink  closing  and/or  closure  duration  (50£  window): 

Closing  duration  is  defined  as  time  from  blink  initiation 
to  peak  closure.  Closure  duration  is  defined  as  time 
between  the  blink  entering  a  window  defined  by  half 
amplitude  of  blink  and  exiting  that  window.  Closure 
duration  has  been  shown  to  be  sensitive  to  visual  task 
demands  (work  load),  with  shorter  closure  duration  blinks 
associated  with  more  demanding  tasks  (demonstrated  in 
laboratory  and  simulation  settings). 

3.2.2  Comments  about  procedures  from  principal  advocates  (Wilson, 

Stern) 

-  Data  can  be  collected  in  simulator  and  in  flight.  Appears 
to  be  sensitive  to  the  type  of  work  load  manipulations 
proposed  in  McDonnell-Douglas/Boeing  joint  effort. 

3.2.3  Potential  problems 

Duration  over  which  data  is  sampled  fm  various  wmk  load 
levels  is  at  the  lowei  end  of  acceptability.  One  minute 
of  data  is  too  short  to  produce  reliable  rate  data;  may 
be  acceptable  for  other  blink  measures. 


-  Since  eye  position  i  n  1  nt  ■  a  >  1  on  will  lie  advocated  as  a 
promising  measure  for  future  implementation,  I  (not  the 
working  group)  would  like  to  recommend  the  recording  of 
horizontal  eye  movement  utilizing  EOG  procedures.  if, 
as  suggested  by  Moray,  alterations  in  dwell  time  on 
particular  instruments  may  index  visual  information 
abstraction  inefficiencies,  then  fixation  pause  durations 
as  evaluated  with  hoc  n.av  provide  a  reasonable  (and  inex¬ 
pensive)  way  of  obtaining  S'uh  (or  .similar)  information. 

3.3  Intra-aural  temperature 

-  Recommendation  based  on  common  t  ••  l>;,  i-.-tet  Hancock.  He  has  been  using 
this  measure  at  NASA-Ames  in  their  simulators. 

-  Possible  measures  advocated  veie:  absolute  temperature  and  tempera¬ 
ture  change  measured  from  one  ear:  and  of  differential  temperatuie 

between  the  ears. 

3.3.1  Comments  from  principal  advor  He 

-  The  measure  reflects  brain  (perhaps  hypothalamic)  tem¬ 
perature.  To  the  extent  that  metabolic  "need"  produces 
enhanced  blood  flow  in  the  brain,  this  technique  allows 
for  the  evaluation  of  such  changes  in  metabolic  activity. 

-  It  is  information  that  is  inexpensively  acquired,  and 
Hancock  will  make  his  device  available  to  the  project. 

3.3.2  Comments  from  panel  participants 

-  We  did  not  have  available  relevant  resource  material  to 
evaluate  tire  claims  by  Hancock.  He  will  provide  Douglas 
with  reprints  relevant  to  the  use  of  this  ptocedute  in  the 
evaluation  of  work  load. 

-  There  were  comment:  about  technological  problems  such  as 
positioning  of  sensor  so  that  it  picked  up  tymponie 
membrane  temperature  rather  than  skin  temperature. 

-  Technical  problems  a  rated  with  the  headset  worn  by  the 
pilot. 

3.3.3  Potential  ptohl.  • 

-  Some  are  listed  e  ■  <  . 

-  The  panel  wa-  •  ■  '  '  i  :  :i  r  t  ••  endoi  •  <■  i<’n  t  <>l  this 

me  mi'  e  b-  ,i".  ‘  .  i  •  •  t  i:  1  .•»  >•  it  ion  .rail  able  to  us 

<  niir-c  [  n  i  I-  1  id  a-  ■-.eusmont  in  gemnal  and 

f  1  i gh  t  s i mu  1  a  t  i  •  i  I  1 1  . 
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4.0  Recommendations  for  the  NEAR  future*.  Measures  recommended  foi 

considerat ion 

4.1  Event  related  brain  potentials  (ERFs) 

4.2  Eye:  point-of-regard  information 

4.3  Voice  analysis 
Rationale  for  choice  of  measures 

4.1.1  Event  related  potentials  (Principal  proponents  -  Kramer, 

Stern) 

-  The  University  of  Illinois  Cognitive  Psychophysiology 
laboratory,  in  conjunction  with  aviation  psychology 
(Wickens),  has  extensive  experience  with  the  use  of  this 
technology  to  evaluate  aspects  of  work  load  as  well  as 
cognition.  The  University  of  Illinois  investigators  have 
focused  on  the  use  of  ERPs  to  both  primary  (embedded)  and 
secondary  task  stimuli,  while  the  Washington  Univeisity 
effort  has  begun  to  investigate  the  use  of  "irrelevant 
probe  stimuli."  All  three  appear  to  be  promising,  with 
respect  to  simulator  applications  and  the  embedded  and 
probe  stimuli  procedures  should  be  usable  in  flight 
environments. 

-  The  use  of  embedded  signals  (embedded  signals  ai e  part  of 
the  primary  task)  for  the  triggering  of  ERPs  was 
recommended  as  an  alternative  to  using  secondary  tasks. 
Other  suggestions  involved  the  use  of  "irrelevant" 
probe  stimuli  to  elicit  ERPs  and  the  use  of  subject  pro¬ 
duced  "responses,"  such  as  saccadic  eye  movements,  to 
trigger  the  averaging  process. 

-  A  fall-out  from  recording  ERPs  is  the  EEC.  Spectral 
analysis  of  this  data  was  suggested  as  another  window  to 
capture  levels  of  atousal  or  alertness  (the  other  window 
being  HR).  Since  alertness  should  not  be  a  problem  in  the 
proposed  simulation,  this  idea  was  not  explored  further. 

-  The  technology  for  moulding  EEC  (including  ERPs)  in 
simulators  and  in  flight  is  available  and  being  used 
(ERPs  -  University  of  Illinois,  Kramer;  AAMRI./HEC,  Wilson 
and  EF.G  in  flight  recording,  B.  Sterman,  centrifuge 
USAFSAM/Eewi  s ) .  ERF?  have  been  demonstrated  to  be  soiisi 
tive  to  task  demand  or  work  load  effects. 

4.1.2  Problems 

-  Signal  /noi  •  «  rat  jo  ‘  igi.ul  of  interest  is  frequent  Iv 
degrade')  and  i<>-  t  during  flight  and  simulation  t  end  1 1  ions. 


"JIG 
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4.1.3  Solutions 

-  Solutions,  such  as  special  electrodes,  amplifying  the 
signal  at  the  source;  and  development  of  filtering  proce¬ 
dures  for  signal  "sensitizing,"  are  in  reasonably  advanced 
stages  of  development. 

4.2  Eye  point-of-regard  -  Principal  proponent  -  Moray. 

-  There  is  considerable  literature  demonstrating  the  utility  of  this 
technique  in  flight  simulation  (this  material  is  available  in  the 
bibliography  provided).  Moray  suggested,  as  an  example,  applying 
this  technique  to  work  load  assessment,  that  dwell  time  on 
instruments  is  generally  in  the  nature  of  500  msec.  If  the  eye 
returns  to  the  same  instrument  frequently  (i.e.,  average  dwell 
time  is  decreased),  it  might  suggest  that  the  person's  ability  to 
retain  information  abstracted  in  short  term  memory  is  impaired. 
This  might  be  considered  an  indicant  of  high  work  load. 

-  With  respect  to  using  the  technique  in  flight,  Moray  expressed 
optimism  about  the  possibility  of  developing  the  necessary  in¬ 
strumentation  to  record  eye  position  information  under  such  con¬ 
ditions.  However,  such  instrumentation  is  not  currently 
available. 

4.3  Voice  analysis  (Principal  proponent  -  Walrath) 

-  Little  comment,  other  than  that  it  might  be  a  useful  procedure 
for  the  future.  There  was  some  discussion  about  its  utility  to 
evaluate  "stress,"  but  few  comments  relevant  to  how  it  might  be 
used  to  evaluate  aspects  of  work  load.  Several  articles 
describing  laboratory  validation  were  entered  in  the  reference 
data  base.  Current  work,  sponsored  by  NASA/Langley  will  be 
monitored.  Since  pilots  engage  in  voice  communication,  using 
voice  output  would  be  the  least  intrusive  measure  of  all  those 
sampled  by  us. 

5.0  What  physiologies)  measures  contribute  to  the  assessment  of  work  load. 

5.1  They  provide  some  measures  that  cannot  otherwise  be  obtained 

(unless  one  can  impose  secondary  tasks),  measures  that  will  be 
most  useful  in  the  evaluation  of  conditions  involving  underload 
of  the  operator.  The  states  of  concern  mentioned  include 
"arousal,"  "alertness,"  "attention,"  "daydreaming,"  "drowsiness," 
and  "microsleep,"  to  mention  but  a  few. 


5.2  They  provide  information  about  moment- to- moment  changes  in  the 

operator,  rather  than  average  values  (averaged  over  time).  Thus, 
points  (or  narrow  windows)  of  "momentary"  overload  can  be 
iden  t i t ied . 
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5.3  They  are  unobtrusive  and  objective. 

5.4  They  should  be  used  to  complement  other  measures.  As  amply 
demonstrated  during  the  general  presentations  and  discussions, 
subjective,  performance,  and  physiological  measures  are  far  from 
perfectly  correlated  with  each  other.  They,  therefore,  provide 
complementary  (not  competitive)  bits  of  information  to  those 
concerned  with  the  evaluation  of  mental  work  load. 

5.5  They  have  the  potential  for  allowing  us  to  discriminate  between 
"controlled"  and  "automatic"  information  processing  by  the 
operator. 

0  Sensitivity 

The  issue  of  sensitivity  of  physiological  measures  to  graded  changes  in 
(or  levels  of)  workload  was  not  discussed.  Our  discussion  focused 
principally  on  looking  for  differences  between  resting  and  "load” 
conditions.  The  literature,  at  best,  has  attempted  to  discriminate 
between  three  levels  of  work  load  (low,  medium,  and  high). 

A  number  of  reasons  (rationalizations)  can  be  used  to  account  for  the 
relative  lack  of  effort  in  this  important  area.  The  first  is  our 
difficulty  in  establishing  a  metric,  other  than  an  ordinal  one,  for 
defining  work  load  levels.  The  second  is  the  lack  of  a  substantial  data 
base  relating  physiological  measures  to  work  load  that  unequivocally 
demonstrates  the  utility  of  such  a  measure  for  work  load  assessment.  The 
field  is  still  young! 

A  third  is  the  hope  that  physiological  measures  will  provide  the  metric 
for  defining  work  load  levels  with  greater  precision  than  currently 
possible.  Ue  should  point  out  that  our  definition  of  "work  load"  may  be 
radically  different  from  that  of  the  Human  Factors  Engineer.  It  is  our 
suspicion  that  their  preferred  definition  deals  principally  with  the  load 
imposed  on  an  operator  by  a  given  configuration  of  hardware.  Our 
preferred  definition  is  in  terms  of  the  impact  of  the  imposed  load  on  an 
operator.  Thus,  the  point  in  the  information  processing  chain  that  is 
sampled  by  those  who  attempt  to  define  work  load  on  the  basis  of  what  is 
imposed  on  the  operator  is  different  from  those  who  focus  on  the  impact 
of  the  imposed  load  on  aspects  of  operator  performance.  In  our  ca^e,  tin; 
performance  measure  is  the  output  from  one  or  more  physiological  systems. 

Uarning  comments 

7.1  At  the  current  stage  of  development,  those  enlisted  to  collect 

physiological  data  must  be  trained:  to  discriminate  signal  ftom 
noise;  in  the  proper  application  of  electrodes;  and  in 
appropriate  signal  conditioning  procedures  (amplification, 
f i 1  ter ing  ,  e' r  ) . 
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7.2  Environments  in  which  bio-electric  data  is  to  be  collected  have 

to  be  "sanitized,"  i.e.,  sources  of  electrical  artifact  have  to  be 
shielded,  equipment  properly  grounded,  etc..  This  may  require  the 
services  of  a  biomedical  engineer. 

7.3  Data  reduction  is,  at  best,  a  semi-automatic  process. 

7.4  Data  collection  and  reduction  is  a  relatively  costly  procedure, 
when  compared  to  paper  and  pencil  tests  or  recording  the  outputs 
from  "manipulanda"  or  in-flight  equipment. 
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Middle  marker 


V  I  I  D  O  V  S 

CLOSK  KXA20BBKSIT  VIIDOif 

FLAPS  5 

1:00  (One  minute  later) 

2:00  (Two  minutes  later) 

2:30  ( Two  and  a  half 
minutes  later) 

5,500  ft 


Outer  Marker 


1 : 30  (One  and  a  sal t 
minutes  later) 
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i  m*T.  -%,‘^e  'J'm  S' 


.VIVW 


<\V> 


\  s  s 


_ •2.. -epee  ale.r~  »:sht  illuminate. 

_ -''-  ENGINE  E;_RC-E  -iCIESE. 

_ ^(ggy  SHEARS  F/E,  "NO.  3  ENGINE  IS  SURGING." 

9/r>  qif>NGTES  #3  EPR,  N1  AND  NIC  FLUCTUATING. 

♦NOTES  #3  EGT  RISING. 

R^NnsVaeA’CAPT  TELLS  F/0,  "BRING  #3  TO  IDLE." 

■\a'-»N«CAPT  TRIMS  RUDDER  AND  STAB  ILI ZER  . _ NGTES  F/cJft 

_ BRINGING  #3  THROTTLE  TO  IDLE. 

_  A(9a\  *HEAR3  F/E,  “#3  IS  STILL  SURGING.  C9C  APPEAR  TO 

.  ,  n  §£  GETTING  COMPRESSOR  STALLS."  \ 

_ illLii.'uVCA^  CALLS ,  "ROGER,  LET'S  SHUT  IT  DGWnXengINE 

..  SHUTDOWN  CHECKLIST."  \ 

_ SLU  ^V^CAPT  AD^S  POWER,  TRIMS  RUDDER  AND  STABILIZER. 

I  /  ¥  CA^AoaJ  uiFT  \ 

00:09:30  ENGINE  FAILURE/SHUTDCWN  CHECKLIST.  \ 

_  2  1  ♦HEARS  F/E,  "ESSENTIAL  POWER  OPERATING  GENERATCr\" 

'  7  S I IS #HEARS  F/E,  "THRUST  LEVER  —  CLOSED."  \ 

— . 3.  _^[  3).  ^CAPT  t^sCsi^NO.  3  THRUST  LEVER.  C  W  u.  C.K.S 

_  ?  ^ REHEARS  F/E,  "START  LEVER  —  CUTOFF." 

1  Vc.APT  PLACES  START  LEVER  TO  CUTCF=L 

'  '  ®F/S  SAYS,  “CARGO  OUTFLOW  VALVE  —  CLOSED." 

•3.  UiR^Tf/E  SAYS,  "GENERATOR  BPEAh  ER  LIGHT  —  CM." 

_ -)  SAYS,  "ELECTRICAL  LOADS  —  MONITOR.  " 

y  li',*F/£  SAYS,  "ENGINE  BLEED  AIR  SWITCH  —  CLOSED.  " 

.  ?  ^  l Vi.\*F/E  SAYS,  "WING  AND  ENGINE  ANTI- ICE  SWITCHES  — 
CLOSED . " 

_ >>^(<2a\hEARS  F/0  RESPOND,  "CLOSED." 

ACV  ■-‘/?Jl^hEHR3  F/e,  "ENGINE  SHUTDOWN  CHECKLIST  COMPETE." 

00:10:00  LEVEL  OFF  AT  11,000  FT  AND  CROSS  RE3AS  INTERSEO" ION • 

_  1.  -  -SEES  ALTITUDE  alER”  .ICH”  EXTINGUISH. 

_ ad\\  ~  .H^TEL-S  F/E  TO  SET  CRUISE  THRUST  POP  ENGINE 

v  INC®  CRUISE. 

- __3l^J*HEA?S  F/E,  "POWER  SET." 

_ nuYffiV  7 ^CAPT  TRIMS  RUDDER  AND  STABILIZER  FOR  CRLISE. 

_ *HEARS  F/E,  "ONE  GENERATOR  INC®  CHECKLIST 

TTTVTTTis  now  complete." 

_ ^13VR(<BQ.Wapt  acknowledges. 

— ^levels  airflane  at  li.ooc  feet  using  aoi, 
al”:metep.  AND  VERTICAL  speed  INDICATOR. 

_ _L3 _ JtPLACCS  FO  AW6-AUT0»*^iT  ALTITUDE  HOLD  SWI-3H  ON. 

_  1,2-,*  ♦ACCSLERATFR  TO  CSS  KNOTS.  (C  ENGINE  CRUISE) 

_ A,^L»CALLST  "LANOING  LIGHTS  OFF. 

_ 1,2.,^  »SET5  AIRSPEED  CURSOR  TO  CSS  KNOTS. 

_  2,g  ♦naggRV/gR  APPROACHING  DESIRED  AIRSPEED. 

U2JL  ♦®F'pR  THRUST  TO  MAINTAIN  DESIRED  AIRSPEED. 

_ 2,%h*QBS£RVE5  DISTANCES  TO  MODESTO  ON  DME 

INDICATOR. 

_ l^SbtODGERVES  OSS  DEG  COURSE  INTERCEPT  TO  MODES” 

VCRT AC. 

e _ 1 ,1.33b*  tn  ITIATES  RIGHT  TURN  ON  HS I  AS  CCUP3E  DEVIATION 

INDICATOR  APPROACHES  CENTER. 


min 


•  PHYSICAL  /  MENTAL  • 

LOW 


SFO-SCK 

SMF 

-SFO 

T/0 

75  /  124 

71 

/  1 17 

CLIMB 

39  /  75 

38 

/  78 

CRUISE1 

13  /  46 

10 

/  37 

CRUISE2 

1  /  6 

1  / 

5 

DESCENT 

47  /  137 

45 

/  141 

APPROACH 

63  /  185 

66 

/  199 

LAND 

28  /  62 

28 

/  62 

■  C 

H  I  G 

H 

SFO- 

SCK 

SMF- 

SFO 

75  / 

124 

71  / 

1  17 

40  / 

82 

39  / 

95 

AUTOPILOT 

INOP 

34  / 

128 

31  / 

121 

NO#  3 

ENGINE  COMPRESSOR 

STALL 

11  / 

50 

11  / 

49 

B  SYSTEM  HYDRAULICS  FAILURE 

46  / 

140 

53  / 

164 

61  / 

177 

64  / 

190 

26  / 

59 

26  / 

59 

TOTALS  266  /  635 


259  /  637 


293  /  760 


295  /  795 


VARIABLES 


BXPBRIMBBTAL 
*(  FACTORS:  2  X  2  X  4  X  7  DESIGN 
FACTOR  OIB:  WORKLOAD  (2  levels)  LOW  and  HIGH 
FACTOB  TWO:  ROUTE  (2  levels)  SFO  -  SCK  A  SMF 
FACTO!  THBBB:  FOUR  POSSIBLE  ORDERS  A ,  B,  C,  & 
FACTOB  FOOB:  MEASUREMENT  EPOCH  (TASK  LOADIBOS) 

1)  Takeoff 

2)  Climb 

3)  Cruise  1 

4)  Cruise  2 

5)  Deaoent 

6)  Approach 

7)  Landing 


SFO 

D 

(7  levels) 


VWW  \rm  u-w  in  %rrn 


DATA 


AIALTSBS 


SBISITITITT 


AIALTSBS 


COBTBBT  TALIDITT 


2X7  ANOVA  Workload  by  Measurement  Epoch  (TASK  LOADIIGS) 


o  Various  Dependent  Variables,  one  at  a  time 


o  MANOVA  approach  presupposes  a  battery  of  tests 


Then  a  series  of  planned  comparisons: 


Considerations:  o  Maintain  oomparison-wise  Type  I  error  rate 
o  Select  a  test  wnioh  is  ROBUST  to  departures 
from  equal  variances  and  unequal  sample  sizes 


o 


SSI3XTXYITY  TO  DXFFS1EIY  PHASES  OF  FLIGHT 


Within  a  Flight  Segment  (i.e.,  SFO  -  SCK)  compare  different 
Measurement  Epochs  (TASK  LOADI1QS) 

SFO  -  SCK  Takeoff  (LOW) 

to 

SFO  -  SCI  Cruise  1  (LOW) 


QUESTION:  "Can  the  workload  measure  discriminate  between  different 
of  tne  same  flight  segment?" 


ph  ase  s 


DATA  A1ALT38S 

C01TKIT  VALIDITIT 


o  SEISITITITT  TO  DIFFKBBHT  FLIGHT  SBGMBIT3  (HIGH  AID  LOW  TASK  LOADIIGS ) 

Between  Flight  Segments  (i.e.,  LOW  A  HIGH)  compare  different 
Levels  of  Workload 

SFO  -  SCK  Descent  (LOW) 
to 

SFO  -  SCK  Descent  (HIGH) 


QUESTION:  "Can  tne  workload  measure  discriminate  between  the  same  phase 
of  the  different  flight  segments?" 


DATA  AIALI3B3 

ALTKjjATg  F0BH3  /  TKST-IETK3T  8ELI ABILITY 

o  SESSION  ONE  TO  SESSION  TWO 

Collapsed  across  Flight  Segments,  compare  different  Measurement 
Epochs  (TASK  LOADXIGS) 

SFO  -  SCK  Descent  (LOW) 
to 

SMF  -  SFO  Descent  (LOW) 


QUESTION:  "Is  the  measure  stable?  Will  the  same  relative  differences  in 
workload  be  found  with  repeated  testing?" 


DATA 


ANALYSES 


COBSTBOCT  VALIDITY 

o  CORRELATION  COEFFICIENTS  COMPARINu  VARIOUS  WORKLOAD  MEASURES 

Collapsed  aoross  Measurement  Epochs  (TASI  LOADINGS),  compare  the  same 
Flight  Segments 

SFO  -  SCK  (HIGH) 

Physiological  Measure  #1 
to 

SFO  -  SCK  (HIGH) 

Subjective  Measure  #1 


QUESTION:  "Are  different  workload  measures  sensitive  to  the  same 
variations  in  TASK  DEMANDS?" 


Simulation  Facility 


Subjects 


Two  Workload  Levels 


failure,  window  overheat) 


Operationally  Relevant  Types  of  Workload 

FAR  25.1523,  Appendix  D 
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Basic  Workload  Functions 


6.  Command  decisions 


Workload  Factors 

FAR-25 
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10.  Incapacitated  crew  member 


Function  and  Factor  Mapping 


pitch  attitude 

Cleared  Direct  Sacramento  Vortac 


Workload  Factors 
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Task  Timeline  Analysis  (TLA) 
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ASSESSMENT  OF  CREW 
WORKLOAD  WORKSHOP  SURVEY 


NAME 


MEASURE  PREFERENCES 

Because  your  views  may  be  different  from  those  already  presented  by  the 
speakers,  or,  from  those  presented  from  your  working  group,  we  are  interested 
in  what  you  believe  to  be  the  best  workload  measures.  If  your  views  are 
different  from  those  already  presented,  please  indicate  which  measures  you 
believe  are  best,  in  order,  and  provide  a  rationale  for  your  view. 


Measure  Type:  Subjective  Physiological  Performance 


Measure  Rationale 


SIMULATION  RECOMMENDATIONS 

How  would  you  improve  the  design  of  the  part-task  Simulation? 


SUGGESTED  ME ISO BBS 


Questionnaire  Summary 

GROUP 

MEASURE 

COUNT 

Physiological 

Voice  Measures 

3 

Physiological 

Auditory  Canal  Temperature 

2 

Physiological 

ERP 

2 

Physiological 

EEG  Spectra 

1 

Physiological 

Eye  Movement 

1 

Subjective 

Modified  Cooper-Harper 

2 

Subjective 

Airbus  2-8  Scale 

1 

Subjective 

Bedford  Scale 

1 

Subjective 

Boeing  Comparative 

1 

Subjective 

In  Flight,  High  Level  Subjective 

1 

Perform.  Sec. 

Time  Estimation 

2 

Perform.  Sec. 

Embedded  Communication 

1 

Perform.  Sec. 

Unobtrusive  Embedded  Secondary  Task 

1 

Perform.  Prim. 

Primary  Task  (mission  effectiveness) 

1 

Other 

Performance /Physiological 

1 

Other 

SSER 

1 

Other 

Subjective/Physiological 

1 

Other 


Timeline  Analysis 


1 


Questionnaire  Summary 


SIMULATION  SUGOKSTIOIS 


SUGGESTED  ADDITIONS: 

o  Increase  the  amount  of  mechanical  failures  (switch  failure  to 
hold ,  CB  fail s ) . 

o  Pre-correlate  workload  with  timeline  analysis  results  to  confirm 
val idity . 

o  Use  glldeslope  as  a  primary  task  measure  for  AZ,  ALT  deviation, 
power  required  (weighted  differently  across  the  scenario), 
o  Include  a  very  high  workload  condition  to  be  sure  the  measures  are 
measuring  anything. 

o  Use  filtering  or  other  0.1Hz  analysis, 
o  U 3e  NDB  approach  or  LDA  approach  to  load  the  captain, 
o  Address  crew  members  as  captain,  first  officer,  and  second  officer 
o  Manipulate  communications  workload. 

o  Perform  an  apriori  teat  for  level  factor  as  a  2X2  (3  treatments 
with  1  within  subjects  factor)  MANOTA.  If  there  is  no  effect,  it 
can  be  eliminated  as  a  factor  for  later  analysis, 
o  Use  synthetic  tasks  early  on,  then  embedded  tasks  later  in  full 
system  flight  tests. 

o  Design  a  new  simulator  with  built-in  data  systems  and  flexible, 
quick-change  capabilities. 

o  Design  criterion  tasks  with  built-in  measurement  schemes, 
o  Measure  co-pilot  workload. 

o  Test  measures  in  an  actual  727  flight  at  some  time, 
o  Include  more  ECG  data  in  1.0,  1.7,  and  1.5  minute  segments  for  FFT 
analysis. 

o  Make  windows  longer  to  allow  heart  rate  variability  to  stabilize. 


SUGGESTED  ELIMINATIONS: 

o  Eliminate  throttle  reversal  (measures  technique,  not  workload). 

o  "Improve  the  scenario" 

o  Eliminate  2  workload  states. 

o  Do  not  use  727  for  testing.  Old  model,  will  probably  never  be 
certified  again.  Can  you  be  certain  of  validity  for  a  highly 
automatic  plane? 

o  Eliminate  compressor  stall  at  10,000  feet  (unrealistic,  should  be 
at  a  higher  altitude). 

o  Do  not  assume  equal  workload  at  the  3  airports  selected. 


SUGGESTED  CONSIDERATIONS: 
o  Assymetrlc  performance  transfer 

o  Range  effects  X  stress  differential  interaction 
o  New  generation  A/C 
o  Full  crew  interaction 
c  Sensitivity  of  measures 

c  Between  subjects  and  between  test  run  variability 
c  Anticipation  between  high  and  low  levels 
o  What  part  t  a  s  k  Simula. ion? 

o  Mechanical  malfunctions  load  second  officer  sort  than  captain. 


Questionnaire  Summary 


SOGGESTBD  BEFBBEICES 


IN  REFERENCE  TO  MEASURES: 
o  Speyer's  handouts  [Speyer] 
o  Otia  Elevator  (c.  1965-1970)  [Parks] 

IN  REFERENCE  TO  SCENARIO: 
o  Poulton  and  Freeman  (1973)  [Hancock] 

o  WPAFB,  Human  Factors  Branch,  Crew  Station  Designs  Division  c.  1979 
(call  Richard  Gesselhart  (513)  255-4109)  [Metcher] 
o  O'Donnell  and  Eggemeier;  Gopher  and  Donchin  (current)  [Derrick] 


